Real-world performance of SARS-Cov-2 serology tests in the United States, 2020

Background Real-world performance of COVID-19 diagnostic tests under Emergency Use Authorization (EUA) must be assessed. We describe overall trends in the performance of serology tests in the context of real-world implementation. Methods Six health systems estimated the odds of seropositivity and positive percent agreement (PPA) of serology test among people with confirmed SARS-CoV-2 infection by molecular test. In each dataset, we present the odds ratio and PPA, overall and by key clinical, demographic, and practice parameters. Results A total of 15,615 people were observed to have at least one serology test 14–90 days after a positive molecular test for SARS-CoV-2. We observed higher PPA in Hispanic (PPA range: 79–96%) compared to non-Hispanic (60–89%) patients; in those presenting with at least one COVID-19 related symptom (69–93%) as compared to no such symptoms (63–91%); and in inpatient (70–97%) and emergency department (93–99%) compared to outpatient (63–92%) settings across datasets. PPA was highest in those with diabetes (75–94%) and kidney disease (83–95%); and lowest in those with auto-immune conditions or who are immunocompromised (56–93%). The odds ratios (OR) for seropositivity were higher in Hispanics compared to non-Hispanics (OR range: 2.59–3.86), patients with diabetes (1.49–1.56), and obesity (1.63–2.23); and lower in those with immunocompromised or autoimmune conditions (0.25–0.70), as compared to those without those comorbidities. In a subset of three datasets with robust information on serology test name, seven tests were used, two of which were used in multiple settings and met the EUA requirement of PPA ≥87%. Tests performed similarly across datasets. Conclusion Although the EUA requirement was not consistently met, more investigation is needed to understand how serology and molecular tests are used, including indication and protocol fidelity. Improved data interoperability of test and clinical/demographic data are needed to enable rapid assessment of the real-world performance of in vitro diagnostic tests.


Introduction
Despite the availability of highly effective COVID-19 vaccines to prevent hospitalization and reduce mortality [1,2], variants continue to fuel the surge of COVID-19 across the U.S. [3,4]. High-quality diagnostic and serology tests are essential tools to better understand the epidemiology of COVID-19 and immunity after infection [5,6]. Viruses and antibodies are primarily detectable within certain temporal windows [7][8][9]. However, many individuals infected with SARS-CoV-2 are asymptomatic or may not seek medical care because of mild symptoms [10]. In contrast to molecular diagnostic tests, serologic tests are informative even once the SARS--CoV-2 infection is no longer present [11,12].
Currently, there are 90 authorized SARS-CoV-2 serology/antibody tests approved for Emergency Use Authorization (EUA) [13]. However, they have not undergone the same evidentiary review standards required for Food and Drug Administration (FDA) clearance due to the COVID-19 national emergency [14,15]. There is a need to assess the real-world performance of these tests. Further, while large studies have shown that greater than 91% of people with active SARS-CoV-2 infection seroconvert [16,17], the factors associated with seroconversion (e.g., pre-existing conditions, the severity of COVID-19 presentation) remain elusive.
From a public health perspective, confidence in the ability of serological tests to identify those with recent infections is critical for effective pandemic planning. Estimates of disease prevalence directly inform dynamic population estimates of susceptible, infected, and recovered, which are needed to understand the infectiousness of SARS-CoV-2 [18]. From a clinical perspective, an accurate understanding of SARS-CoV-2 exposure is necessary to understand disease presentation and a clinical course of action, especially when patients do not present with symptoms or present late in their disease course (e.g., post-acute sequelae of SARS-CoV-2). Additionally, identifying factors associated with seropositivity may elucidate potential mechanisms of action that may be foundational in the development of therapy and treatment plans.
To address these gaps, we characterize the performance of serology tests by estimating the positive percent agreement (PPA) of serological samples obtained from people known to be positive for SARS-CoV-2 infection by molecular assay (e.g., PCR). We also sought to identify factors associated with seropositivity. Findings from this study may facilitate understanding of the real-world performance of serology tests, many of which were issued under EUA, and may help inform our understanding of the immune response to SARS-CoV-2.

Study population and setting
Six health systems (i.e., datasets) collaborated on the Diagnostics Evidence Accelerator (EA): Health Catalyst, Mayo Clinic, Optum Labs, Regenstrief Institute, the University of California Health System, and Aetion and HealthVerity. The EA is a consortium of leading experts in health systems research, regulatory science, data science, and epidemiology, specifically assembled to analyze health system data to address key questions related to COVID-19. The EA provides a platform for rapid learning and research using a common analytic plan. Health Catalyst, Mayo Clinic, and the University of California Health System all utilized electronic health records (EHR) data from their respective healthcare delivery systems. The Regenstrief Institute accessed EHR and public health data from the Indiana Health Information Exchange [19,20], while Aetion sourced healthcare data from HealthVerity Marketplace encompassing medical claims, pharmacy claims, hospital chargemaster, and data collected directly from laboratories. Optum Labs data included de-identified medical, and pharmacy claims as well as laboratory results data utilized medical, and pharmacy claims from a single, large U.S. insurer as well as data directly from laboratories. We refer to these health systems as datasets A-F for the purposes of anonymity. Data sources included in the analysis are generally categorized as either payer (claims) or healthcare delivery systems. As illustrated in Fig 1, data were drawn from across the U.S. with heavy representation in California, Illinois, Ohio, and Michigan. Characteristics of participating data sources and representative populations are described in the S1 Table.

Study design
In this retrospective cohort study, we identified patients across different settings (e.g., inpatient, outpatient, emergency department (ED), or long-term care facility) who tested positive  for SARS-CoV-2 ribonucleic acid (RNA) by molecular test between March-September 2020 and who received at least one subsequent serological test for SARS-CoV-2 immunoglobulin (Ig) G or Total antibody (Ab) from 14-90 days after the positive RNA test (Fig 2). We analyzed the first serology test in the 14-90-day follow-up period, which ended on December 31, 2020. "Date of RNA positive" served as the index (cohort entry) date and was defined hierarchically as either the date of 1) sample collection; 2) accession; or 3) result. Because the optimal time to observe a positive serology is at least two weeks after the index date, we only include patients who had at least one serology test 14-90 days after the index date [1][2][3][7][8][9].
To minimize the effect of differential missingness between datasets, we applied the following rules: 1) included all persons with an office or telephone visit in the +/-14 days around the index date to enable as complete an assessment of presenting symptoms as possible; 2) in claim systems, included only persons with at least six months of enrollment in the year before index; 3) estimated the proportion of patients at each site who had zero encounters in the prior year to contextualize our capture of pre-existing conditions, and 4) excluded variables from analysis if �30% of values were missing.
The Western-Copernicus Group (WCG) Institutional Review Board (IRB), the IRB of record for the Reagan-Udall Foundation for the FDA, reviewed the study and determined it to be non-human subjects research. Additionally, all legal and ethical approvals for use of the data included in this study were submitted, reviewed, and/or obtained locally at each contributing dataset by an IRB and/or governing board.
Covariates. We collected demographic, behavioral, and environmental characteristics, baseline clinical presentation, key comorbidities, and test characteristics, including manufacturer, according to a diagram illustrating potential factors associated with serology testing (Fig  3). We identified comorbidities and clinical presentation using phenotypes defined by the International Classification of Diseases 10 (ICD-10), and/or National Drug Codes. We identified comorbidities (pre-existing conditions) in the 365 days before the index date through 15 days before the index date. We provided coding algorithms for groups to use, while some groups used existing algorithms generated by their site. The ICD-10 codes used to identify comorbidities are listed in the S2 Table. We also stratified analyses by RNA tests conducted before June 15, 2020, which marked the beginning of the summer wave of infections in the first year of the pandemic, compared to on or after that date.

Statistical analysis
Each contributing dataset ran its analysis according to a common protocol. Results were reviewed as a group to ensure alignment with the protocol and to review any protocol deviations. We calculated PPA as: (Number of positive antibody results � Number of positive RNA results) x 100. We calculated PPA based on the first eligible serology test in the follow-up period overall and by age, sex, race, ethnicity, U.S. region, pregnancy status, pre-existing conditions, including but not limited to cardiovascular disease, obesity, hypertension, kidney disease, asthma, dementia, chronic liver disease, and smoking status. We also report the PPA by presenting symptoms, and serology tests at the time of the first serology test. We examined variations in PPA by serology tests and time, and serology tests and symptom presentation. We also examined variations in PPA by geography and care setting over time. We calculated exact (Clopper-Pearson) 95% confidence intervals (CI). We report significant differences where 95% CI have complete separation-although we did not conduct formal statistical comparisons of PPA between groups.
To study the odds of seropositivity, we estimated a model for the association to identify independent risk factors for seropositivity, assuming a binomial distribution for seropositivity status. Results are presented as the odds ratio (OR) and 95% CI that was calculated using score confidence intervals or exact CI [33]. All variables were treated as categorical. Symptoms were reported as a binary variable: "1" if any of the following symptoms were present: fever >100.4, abnormal chest imaging finding, high respiratory rate, low blood pressure, diarrhea, hypoglycemic, chest pain, delirium/confusion, headache, sore throat, cough, shortness of breath,

PLOS ONE
pneumonia, acute respiratory infection, acute respiratory distress, cardiovascular presentation, renal presentation; and "0" otherwise. For datasets with data covering >1 geographic catchment area, geography was included as either one of four U.S. Census regions, or nine U.S. Census divisions based on patient home zip code. Variables with >30% missing/unknown values were excluded from models (except for pregnancy, pre-existing condition, or presenting symptoms, all of which were included). Each dataset used automated backward selection to remove non-significant pre-existing conditions while forcing all other covariates into the model. All analyses were performed using SAS software, version 9.2 or higher (SAS Institute, North Carolina, U.S.); or the Aetion Evidence Platform v4.13 (including R v3.4.2), which includes audit trails of all transformations of raw data and a quality check of the data ingestion process.

Results
Samples sizes across datasets ranged from 660-7,115; a total of 15,615 people with at least one serology test 14-90 days after the index date were included in the analyses. Between 35-65% of patients identified from health care delivery systems had no documented encounter in the system between 365 and 15 days before the index date. In contrast, only 11% of patients from national insurers reported having zero claims in the baseline period. As shown in Table 1, the serotested population was primarily 45-64 years of age (>40%), with a history of cardiovascular disease, including hypertension (8-70%). Race and ethnicity data were robust (<30% missing) in four datasets. The serotested population in those datasets was primarily White (>53%) and non-Hispanic (>65%), From datasets with national representation, persons from the Northeast (New England and Mid-Atlantic) were most represented in this serotested population. In datasets that represent regionally-based healthcare delivery systems, their population reflected their locations: Pacific and Midwest. Information on manufacturer test names was provided in four datasets. Generally, 2-3 primary tests were utilized in each dataset; 4 of 7 tests reported were used in >1 dataset. We did not observe any difference by age or sex in those for whom the test name was known versus unknown. In a single dataset with <30% of missing data on race/ethnicity, we observe over-representation of White and Hispanic people in those for whom the test name was known.

Positive percent agreement (PPA) of serology among molecularly confirmed SARS-CoV-2
The overall PPA ranged from 65-90% across analytic datasets ( Table 2). The real-world PPA met the EUA requirement of �87% in three datasets (A, B, D) [34]. Two of these datasets represented national administrative claims and associated results with the date the sample was collected or received by the laboratory; the third represented data from EHRs and associated results with the date the test was conducted, which is lagged further from the clinical interaction than the former. Overall PPA was likely influenced by the mix of serology tests represented in each dataset. Seven serological tests were reported in this analysis, of which two (Δ and Γ) met the EUA PPA requirements. Two tests were used across multiple datasets and performed similarly above the EUA requirement. PPA by serology test type varied across datasets; with three of five reporting significantly lower PPA from total antibody (PPA range: 69-90%) compared to IgG (PPA range: 87-92%); and two showing no difference. We observed no difference in PPA with antibody tests that target spike compared to nucleocapsid proteins.
PPA was significantly higher in Black (PPA range: 86-92%), as compared to White (PPA range: 78-86%), persons in at least two of the four datasets reporting robust race/ethnicity data. PPA was significantly higher in Hispanic (PPA range: 79-96%), compared to non-Hispanic (PPA range: 60-86%), patients. PPA appeared highest in those with diabetes (PPA     range: 75-94%) and kidney disease (PPA range: 75-95%), and lowest in those with conditions that leave them immunocompromised (PPA range: 56-93%). We observed higher PPA in the inpatient (PPA range: 70-97%) or ED (PPA range: 93-99%) setting compared to outpatient (PPA range: 63-92%). There was some evidence of higher PPA among patients with at least one COVID-19 related symptoms as compared to those with none (PPA range: 63-91%) among two datasets (B and D); and was particularly high for select conditions like pneumonia (PPA range: 82-97%). However, differences in the PPA by the presence of symptoms do not appear to be explained by the test. A stratified analysis by test comparing those with and without symptoms ( Table 3) showed no significant difference in PPA. PPA trends by calendar time were not consistent across datasets.

Factors associated with seropositivity
In adjusted models (Figs 4-9), the OR for seropositivity was significantly elevated in Hispanic compared to non-Hispanic ethnicity (OR range: 2.59-3.86); among those with pre-existing diabetes (OR range: 1.49-1.56) and obesity (1.63-2.23) as compared to those without preexisting conditions; and among those observed in the ED compared to outpatient (OR range: 2.49-10.97). The OR for seropositivity was significantly lower in those with pre-existing immunocompromised or autoimmune conditions compared to those without such conditions (OR range: 0.25-0.70). In two of three datasets that included pre-existing cardiovascular disease in the OR model, the OR for seropositivity was significantly lower in persons with, compared to those without, such conditions (OR range: 0.49-0.57). The OR for seropositivity tended to be lower on or after June 15 compared to prior in half the datasets, but differences were not significant in the other half.

Discussion
Serology tests are an important instrument in the toolkit to understand the epidemiology of COVID-19 because of their ability to identify persons with prior infection who may present too late in the infectious period due to mild symptoms, or no symptoms at all. Serology results may inform diagnoses of post-acute SARS-CoV-2 (PASC) and the appropriate treatment course, which may depend on whether patients are at increased risk for severe illness due to insufficient antibody response [35]. The reported sensitivity of the serology tests included in this analysis that were submitted for EUA approval were all >95% [36]. Our analysis of multiple large datasets of patients with confirmed SARS-CoV-2 infection suggests that serology tests performed lower than = expected-with PPA ranges (a measure analogous to sensitivity) from 65-90%.-Our results align with results from smaller, detailed laboratory evaluations that suggest a lack of harmonization, including optimization of cut-off values, may contribute to decreased overall performance. Additionally, our results align with studies that include more representative samples of milder or asymptomatic persons [37][38][39]. Two of seven tests reported across datasets achieved the EUA requirement of PPA � 87%. As we did not have data on specific serology-molecular pairs or meta-information on the tests (including fidelity to protocols for serology and molecular test analysis), these results reflect more on the realworld implementation of the tests rather than the true quality of the tests. Specifically, where the same test was used across multiple datasets, they all performed similarly. For example, the serology test Γ performed similarly high (PPA >90%) across three datasets. However, the overall PPA for tests performed in datasets A and B were higher than in dataset E. A major factor that may have contributed to this difference is that the other serological tests reported to datasets A and B performed above the EUA requirement. In contrast, the other tests reported in dataset E performed below the EUA requirement. Additionally, datasets A and B leveraged administrative claims data and associated RNA and serology results with sample collection or sample receipt date, while dataset E associated results with the date the test was run. Dataset E also represents those from a healthcare delivery system where serology tests were initially only used for symptomatic patients with at least 12 days of symptoms. This practice shifted after approximately two months (June 1, 2020) to a protocol that required both  https://doi.org/10.1371/journal.pone.0279956.g008 molecular and serological testing for SARS-CoV-2 as part of pre-procedure screening. This protocol was in effect for another three months (August 31, 2020), after which the healthcare system shifted to unrestricted testing for both molecular and serology tests and saw a substantial drop in the use of serological testing. We expected that procedural "lags" to serotesting, combined with additional lags due to associating results with a date downstream from the clinical interaction, may have further extended the time between infection/symptom onset and the actual time of serology sampling. The impact of this misclassification may be most important for serology samples at the upper bounds of 90 days; where samples were likely >90 days from the point of infection and humoral antibodies more likely to have declined. Despite changes in the protocol over time, we observed no overall or test-specific difference in PPA before or since June 15, 2020 in dataset E. Nevertheless, administrative protocols create lags in serotesting that challenge our assumptions of whether the observed molecular "test date" is a good proxy for symptom onset. Absent any knowledge of such policy, it's difficult to make broad assumptions regarding patterns in molecular or serology testing unless established clinical protocols are known.
We observed that patients of Hispanic ethnicity compared to non-Hispanic patients, with pre-existing obesity and those who presented in the ED had a higher OR for seropositivity; and similarly higher PPA. These results further support what others have observed that persons with unmanaged diabetes, who are disproportionately people of color, are vulnerable to hyper-inflammation related to COVID-19 [40]. Furthermore, hyper-inflammation, including pro-inflammatory cytokine storm, has been associated with severe disease, reduced viral clearance [41], and sustained antibody production [42]. Although a recent small study showed that while a low viral load is associated with lower antibody response, clinical illness does not guarantee seroconversion [43]. Other studies have demonstrated people with cancer have a lower probability of mounting an immune response from the vaccine, as demonstrated by seroconversion, viral neutralization, and T-cell response [44,45]. Our results demonstrating lower odds of seropositivity among those with cancer and other immunodeficiencies suggest that the same may be true regarding their antibody response to infection.

Strengths
Our study has many strengths. This was a large assessment of serotesting across the U.S. in diverse datasets leveraging either EHR or claims data. We developed a protocol that incorporated the unique characteristics of each data source and provided a forum to transparently communicate and collaborate on study design and interpretation. We also established a platform to rapidly collect and analyze data from various systems to evaluate process improvement and identify important trends over time. Such a platform may be used to evaluate process improvement and comparisons within data systems. We did extensive characterization of missing data to guide model development and help with interpretation. Additionally, this study was conducted before public availability of COVID-19 vaccines across the U.S., which minimizes the potential for confounding related to vaccine-induced antibodies.

Limitations
A major limitation in this real-world analysis is a large number of missing test names and relevant meta-data, including quality control measures adopted, for both molecular and serological tests. As such, we were unable to account for molecular-serology pairs when assessing PPA or the fidelity with which these tests were performed. A large amount of missing test name information limited our ability to describe trends by the manufacturer. Although, a thorough examination of missing data does not suggest differential missingness by age or sex. Importantly, the intent of this analysis was not to evaluate individual tests, but the performance of serology in the context of real-world implementation of test protocols and varying reference standards. As discussed in our prior manuscript, the sample included in this study included those who were more likely to be serotested for SARS-CoV-2: White, 45-64 years of age, with prior history of cardiovascular disease. Nevertheless, there was still sufficiently large number of people to assess PPA trends among younger ages and in those with and without other pre-existing conditions. Finally, this study was conducted before the surge of the Omicron variant, which has been shown to have a number of mutations on the N-gene and S-gene that reduce the sensitivity of some diagnostic tests [46]. As such, our inference is limited to the SARS-CoV-2 variants prior to Omicron, primarily alpha.

Conclusion
Across large samples of patients with molecularly confirmed SARS-CoV-2, serology tests did not consistently meet the EUA requirement of PPA � 87% in the post-market setting. However, given the limited availability of test names, this analysis serves as a signal that further investigation into how serology and molecular tests are used, including protocol fidelity, is needed to understand ways to improve the real-world performance of serology tests.
Despite differences in testing protocols and data availability, the similarity in performance of serology tests across datasets suggests that serology tests were robust to differences in care settings. However, the real-world PPA for several serology tests did not meet EUA requirements; and the exclusive representation and low use of such tests in certain datasets look to have impacted the overall performance of serology tests in those datasets. Where data were sufficiently robust, we observed that people of Hispanic ethnicity had a higher odd of seropositivity than non-Hispanics. Higher odds of seropositivity in those with pre-existing diabetes or obesity further support the hypothesis that these conditions are associated with more severe disease, reduced viral clearance, and the sustained presence of antibodies. Conversely, lower odds of seropositivity among those with cancer and other immunodeficiencies suggest that immunopathology in these groups associated with the vaccine may extend to infection.
Interpreting results from real-world data collected from clinical and administrative databases is challenging. A clear understanding of testing protocols at the point of care is needed to validate assumptions regarding proxy variables and to interpret results. Incomplete information on race/ethnicity and test name limited our ability to address racial disparities in testing and real-world performance of serological tests. Nevertheless, implementing best practices for analyzing and reporting results from observational data across multiple datasets yields confidence in trends that are repeated. And where results are divergent, we were able to explore how differences in data sources may explain findings and target areas for future investigation. Improved data interoperability to link test names and clinical/demographic data is critical to enable rapid assessment of the real-world performance of in vitro diagnostic tests, particularly in the face of fast-mutating pathogens.